Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

256 ◾ Bioinformatics

tools proposed an extra step before taxonomic assignment. This step is meant to reduce the

potential errors that may produce noises; therefore, the step is called denoising.

7.2.2.2 Denoising

There are two possible types of errors that may occur on deciding whether the variation

within an OTU represents errors or real diversity. The first type is the base calling error

which may arise from the sequencing. This type of errors may occur due to the incorrect

base pairing during the PCR amplification, polymerase slippage, or PCR chimeras that

are formed when the DNA strand extension is aborted during the PCR process and the

aborted products act as primers in the next PCR cycle producing artifacts. The second

type of errors is the misclassification of a read to an incorrect taxonomic group. This error

can be corrected by constructing OTUs at a particular similarity threshold such as 97%.

However, that may come at the cost of taxonomic sensitivity. Denoising is attempting to

handle these errors by using the reads to infer the correct biological sequences. This way

the misclassification can be avoided.

Several computational approaches have been proposed for sequence denoising. The most

commonly used approaches are DADA2, Deblur, and UNOISE3 which are able to infer

error-free biological sequences at a single-nucleotide resolution. Those inferred sequences

that will be used for taxonomic assignment are called features, zero-radius OTUs (ZOTUs),

exact sequence variants (ESVs), or amplicon sequence variants (ASVs). In the following, we

will discuss those three popular denoising methods.

7.2.2.2.1 DADA2 Denoising

DADA2 (Divisive Amplicon Denoising Algorithm 2) [8] was adapted to use with Illumina

sequencing and available as an open-source R package and as plugin in QIIME2, which

is an open-source command-line Linux program. DADA2 implements a new model of

Illumina-sequenced amplicon errors that incorporates quality information of the within-

sequence errors and between-sequence errors. The model quantifies the error rate (λ) at

which an amplicon read is produced from a sample sequence as a function of a sequence

composition and quantity. The number of amplicons or abundance follows the Poisson

distribution with the parameter

λ , which is the error rate at which an amplicon read

with sequence i is produced from sample sequence j. The abundance of the sequence i has

an expected value equal to an error rate

λ multiplied by the expected reads of sample

sequence j.

The DADA2 model assumes that errors occur independently with a read and indepen-

dently between reads. The model then estimates the error rate as the product over the

transition probabilities between the L aligned nucleotides and associated quality score of

the original nucleotide as follows:

p j l

i l q l

∏

(

)

( )

→

(7.1)